Corpus Linguistics and the Automatic Analysis of English
نویسنده
چکیده
In a recent paper advocating a corpus-based and probabilistic approach to grammar development, Black, Lafferty, and Roukos (1992) argue that "the current state of the art is far from being able to produce a robust parser of general English" and advocate "steady and quantifiable," empirically corpus-driven grammar development and testing. Black et al. are addressing a community in which armchair introspection has been and still is the dominant methodology in many quarters, but in some parts of Europe, corpus linguistics never died. For nearly two decades, the Nijmegen group led by Jan Aarts have been undertaking corpus analyses that, although motivated primarily by the desire to study language variation using corpus data, are particularly relevant to the issue of broad-coverage grammar development. In distinction to other groups undertaking corpus-based work (e.g., Garside, Leech, and Sampson 1987), the Nijmegen group has consistently adopted the position that it is possible and desirable to develop a formal, generative grammar that characterizes the syntactic properties of a given corpus and can be used to assign appropriate analyses to each of its sentences. Nelleke Oostdijk's book provides a detailed description of the cumulative development of a grammar capable of analyzing a one million-word corpus of English written texts, drawn from a wide but balanced variety of sources. This task forms a significant component of the wider Tools for Syntactic Corpus Analysis (TOSCA) project being undertaken at Nijmegen. Oostdijk's work provides an excellent example of the strengths and weaknesses of the approach advocated by Black et al. In addition, she discusses issues such as sampling and tokenization of corpus material, as well as the exploitation of the analyzed corpus in studies of language variation. However, in this review I will concentrate on the central core of her book: the development of the grammar and performance of the associated parser, since this is the part that is most relevant to computational linguistics. Oostdijk begins by locating her work and the TOSCA project within the field of computational linguistics (arguing that it is distinguished by "an interest in language itself as it is actually produced" (p. 2)) and contrasting it to the LSP system (Sager 1981) and Parsifal (Marcus 1980). The comparison is brief and the choice odd since more general broad-coverage grammars, such as DIAGRAM (Robinson 1982), PEG (Jensen et al. 1986) and ANLT (Grover et al. 1989), and more corpus-oriented parsing systems, such as FIDDITCH (Hindle 1983, 1993) or MITFP (de Marcken 1990), have been developed within the field, but are not discussed anywhere. A similar suspicion of isolationism recurs in the sections dealing with the grammatical formalism used;
منابع مشابه
Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کاملGearing the Discursive Practice to the Evolution of Discipline: Diachronic Corpus Analysis of Stance Markers in Research Articles’ Methodology Section
Despite widespread interest and research among applied linguists to explore metadiscourse use, very little is known of how metadiscourse resources have evolved over time in response to the historically developing practices of academic communities. Motivated by such an ambition, the current research drew on a corpus of 874315 words taken from three leading journals of applied linguistics in orde...
متن کاملCultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis
This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...
متن کاملHedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners
Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...
متن کاملDo We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)
This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...
متن کاملVerbs in Applied Linguistics Research Article Introductions: Semantic and syntactic analysis
This study aims to investigate the semantic and syntactic features of verbs used in the introduction section of Applied Linguistics research articles published in Iranian and international journals. A corpus of 20 research article introductions (10 from each journal) was used. The corpus was analysed for the syntactic features (tense, aspect and voice) and semantic meaning of verbs. The finding...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1991